Fast and Space-Efficient Computation of Equi-Depth Histograms for Data Streams
نویسنده
چکیده
Equi-depth histograms represent a fundamental synopsis widely used in both database and data stream applications, as they provide the cornerstone of many techniques such as query optimization, approximate query answering, distribution fitting, and parallel database partitioning. Equi-depth histograms try to partition a sequence of data in a way that every part has the same number of data items. In this paper, we present a new algorithm to estimate equi-depth histograms for high speed data streams over sliding windows. While many previous methods were based on quantile computations, we propose a new method called BAr Splitting Histogram (BASH) that provides an expected ε-approximate solution to compute the equi-depth histogram. Extensive experiments show that BASH is at least four times faster than one of the best existing approaches, while achieving the same accuracy and using less memory. The experimental results also indicate that BASH is more stable on data affected by frequent concept shifts.
منابع مشابه
Equi-depth Histogram Construction for Big Data with Quality Guarantees
The amount of data generated and stored in cloud systems has been increasing exponentially. The examples of data include user generated data, machine generated data as well as data crawled from the Internet. There have been several frameworks with proven efficiency to store and process the petabyte scale data such as Apache Hadoop, HDFS and several NoSQL frameworks. These systems have been wide...
متن کاملDynamic Maintenance of Wavelet-Based Histograms
In this paper, we introduce an e cient method for the dynamic maintenance of wavelet-based histograms (and other transform-based histograms). Previous work has shown that wavelet-based histograms provide more accurate selectivity estimation than traditional histograms, such as equi-depth histograms. But since wavelet-based histograms are built by a nontrivial mathematical procedure, namely, wav...
متن کاملA Survey of Synopsis Construction in Data Streams
The large volume of data streams poses unique space and time constraints on the computation process. Many query processing, database operations, and mining algorithms require efficient execution which can be difficult to achieve with a fast data stream. In many cases, it may be acceptable to generate approximate solutions for such problems. In recent years a number of synopsis structures have b...
متن کاملAn Efficient Parallel Algorithm for High Dimensional Similarity Join - Parallel Processing Symposium, 1998, and Symposium on Parallel and Distributed Processing 1998. 19
Multidimensional similarity join finds pairs of multidimensional points that are within some small distance of each other: The 6-k-d-B tree has been proposed as a data structure that scales better as the number of dimensions increases compared to previous data structures. We present a cost model of the E-k-d-B tree and use it to optimize the leaf size. We present novel parallel algorithms for t...
متن کامل